Github: https://github.com/pavanchavda/PC-ANLY506/tree/master/Code

Introduction

The purpose of this research project is to analyze the relationship between income, life expectancy, and population variables for countries around the world using the gapminder dataset. I will use the techniques and knowledge gained about the exploratory data analysis through ANLY 506 course to explore the data, clean it, and create interesting visulizations that would help answer the research questions that I have posed below. For visulizations, I will use the combination of boxplot, interactive scatterplot, tree maps, etc. to explore the data visually.

Research Questions

The focus of this research project will be on these questions:

Data Source

The data used in this study was compiled by Gapminder Foundation. Gapminder Foundation is a non-profit organization that promotes sustainable global development and achievement of the United Nations Millennium Development Goals by increased use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels.[Gapminder Wikipedia]. Our dataset contains a total of 41,824 records and has 6 different variables: country, year, life, population, income(aka GDP per Capita), and region. The dataset contains data starting year 1800 to 2015.

Exploratory Data Analysis

Before we dive into the results and start creating visualizations, let’s first explore the data so we can familiarize ourself with the structure of the data and so we can ensure that the data we are using is accurate and of good quality. If the data is inaccurate, so will be the results.

1. Run str()

We will run the str()function on our data to review its structure. The str() function reveals that population variable is saved as factor and year variable is saved as integer data type. Let’s convert the population datatype to numeric so and year variable to factor. I also noticed that some variable starts with lower letter and some variable names start with upper letter. For consistency, I also updated the variable name to start with upper letter.

str(data)
## 'data.frame':    41284 obs. of  6 variables:
##  $ Country   : Factor w/ 197 levels "Åland","Afghanistan",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Year      : int  1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ...
##  $ life      : num  28.2 28.2 28.2 28.2 28.2 ...
##  $ population: num  3280000 NA NA NA NA NA NA NA NA NA ...
##  $ income    : int  603 603 603 603 603 603 603 603 603 603 ...
##  $ region    : Factor w/ 6 levels "America","East Asia & Pacific",..: 5 5 5 5 5 5 5 5 5 5 ...
#Change the datatype for population and Year column
data$population <- as.numeric(as.character(data$population))
data$Year <- as.factor(data$Year)

#Update column names for variables
colnames(data)<- c("Country", "Year", "LifeExpectancy", "Population", "Income", "Region")

2. Looking at the top and the bottom of your data

The top and bottom of the data reveals no immeadiate concern about the quality of the data.

#Check top few rows of the data
head(data)
##       Country Year LifeExpectancy Population Income     Region
## 1 Afghanistan 1800       28.21100    3280000    603 South Asia
## 2 Afghanistan 1801       28.20075         NA    603 South Asia
## 3 Afghanistan 1802       28.19051         NA    603 South Asia
## 4 Afghanistan 1803       28.18026         NA    603 South Asia
## 5 Afghanistan 1804       28.17001         NA    603 South Asia
## 6 Afghanistan 1805       28.15977         NA    603 South Asia
#Check botton few rows of the data
tail(data)
##       Country Year LifeExpectancy Population Income                Region
## 41279  Åland 1992          80.83      24834     NA Europe & Central Asia
## 41280  Åland 1993          81.80      24950     NA Europe & Central Asia
## 41281  Åland 1994          80.63      25066     NA Europe & Central Asia
## 41282  Åland 1995          79.88      25183     NA Europe & Central Asia
## 41283  Åland 1996          80.00      25301     NA Europe & Central Asia
## 41284  Åland 1997          80.10      25419     NA Europe & Central Asia

3. Check the packaging

Number of rows and column matches what I had expected.

#Check number of rows in our data
nrow(data)
## [1] 41284
#Check number of columns in our data
ncol(data)
## [1] 6

4. Check the “n”s

There are 197 unique countries in our dataset. Currently there are 195 countries in the world, but our dataset contains data from year 1800 so there might be data for countries that do not exist now. The countries are divided into 6 regions: America, East Asia & Pacific, Europe & Central Asia, Middle East & North Africa, South Asia, Sub-Saharan Africa.

#Counting the number of distinct countries in our data using length and unique
length(unique(data$Country))
## [1] 197
#Checking the frequency of regions in the data
table(data$Region)
## 
##                    America        East Asia & Pacific 
##                       7961                       6256 
##      Europe & Central Asia Middle East & North Africa 
##                      10468                       4309 
##                 South Asia         Sub-Saharan Africa 
##                       1728                      10562

Summary of the data shows that population variable contains about 25,817 NA’s and income variable contains 2,341 NA’s. A quick glimpse at the data shows The NA’s in population variable are due to the fact that the data is only available every 10 years until 1950. For NA’s in income variable, lets examine further to see which countries we are missing the data for.

#Summarize the data to obtain descriptive statistics
summary(data)
##                 Country           Year       LifeExpectancy 
##  Afghanistan        :  216   1997   :  197   Min.   : 1.00  
##  Albania            :  216   1988   :  196   1st Qu.:31.00  
##  Algeria            :  216   1989   :  196   Median :35.12  
##  Angola             :  216   1990   :  196   Mean   :42.88  
##  Antigua and Barbuda:  216   1991   :  196   3rd Qu.:55.60  
##  Argentina          :  216   1992   :  196   Max.   :84.10  
##  (Other)            :39988   (Other):40107                  
##    Population            Income                              Region     
##  Min.   :1.548e+03   Min.   :   142   America                   : 7961  
##  1st Qu.:5.335e+05   1st Qu.:   883   East Asia & Pacific       : 6256  
##  Median :3.358e+06   Median :  1450   Europe & Central Asia     :10468  
##  Mean   :2.119e+07   Mean   :  4571   Middle East & North Africa: 4309  
##  3rd Qu.:1.078e+07   3rd Qu.:  3483   South Asia                : 1728  
##  Max.   :1.376e+09   Max.   :182668   Sub-Saharan Africa        :10562  
##  NA's   :25817       NA's   :2341

The table below shows the number of years we are missing the data for income varialble by country. Out of those 15 countries that have, only Croatia is a major country with somewhat significant population. Rest of the countries are very small in terms of overall population and will not skew our analysis.

#Create table that shows the number of years we are missing the income data for 
kable(data %>%
    group_by(Country) %>%
    summarise_all(funs(sum(is.na(.)))) %>%
    filter(Income>0),format = "html",padding = 2,table.attr = "id=\"mytable\"")
Country Year LifeExpectancy Population Income Region
Åland 0 0 0 10 0
Channel Islands 0 0 0 55 0
Croatia 0 0 135 20 0
French Guiana 0 0 135 205 0
French Polynesia 0 0 135 205 0
Guadeloupe 0 0 135 205 0
Guam 0 0 135 205 0
Martinique 0 0 135 205 0
Mayotte 0 0 135 205 0
Netherlands Antilles 0 0 150 205 0
New Caledonia 0 0 135 205 0
Reunion 0 0 135 205 0
Tokelau 0 0 0 1 0
Virgin Islands (U.S.) 0 0 135 205 0
Western Sahara 0 0 135 205 0

Data Visualization

So far We have cleaned up the data and use exploratory data analysis techniques to familiarize ourself with the data. Now the fun part begins. We can go ahead and start exploring the data visually.

While there is a vast amount of data available to analyze, for the purpose of this study, I will only be analysing the data for the most recent year available: year 2015. So lets go ahead and create a new data set with only 2015 data first.

#Filter data for year 2015 and assigning it to a new varialble called data2015
data2015 <- data %>%
  filter(Year==2015)

Income per Region in 2015

Figure 1 below shows that Europe & Central Asia has the highest income of around $25,000. On the other hand, South Asia and Sub-Saharan regions have the lowest GDP per capita. We can also see that the inter quartile range (IGR) for Middle East & North Africa is much larger than other regions. This is likely because of the large difference in GDP per capita for middle east and north african countries.

#Plot boxplot 
ggplot(data2015,aes(Region,Income,fill=Region))+geom_boxplot()+
  labs(title = "Figure 1: GDP per Capita by Region in 2015",x="Region",y="Income(GDP per Capita)")+
  scale_y_continuous(labels = scales::comma, breaks=seq(0,150000,25000))+
  theme(panel.background = element_blank(), panel.grid = element_blank(),legend.position = "none",axis.text.x=element_text(angle = 10,vjust = 0),axis.title.x = element_blank())

Correlation between Income and Population by Country and Region in 2015

Figure 2 shows a treemap of income and population by country and region. The size of the rectangle represents the population of the country while the color of the country represents the income. Figure 2 shows that income in Europe & Central Asia and America is much higher compared to other regions while Sub-Saharan Africa region has the least income. We can also see that approximately 25 countries in East Asia & Pacific and South Asia region represent about 50% of the global population. Majority of the population is represented by India and China in those regions. Figure 2 also shows that countries with lower population usually has higher GDP per capita. This is not very surprising considering the GDP per capita takes the population of the country into account.

#Plot treemap
treemap(data2015,
        index=c("Region", "Country"),
        vSize="Population",
        vColor="Income",
        type="value",
        format.legend = list(scientific = FALSE, big.mark = " "),
        title = "Figure 2: Treemap of Population and Income by Country and Region",
        overlap.labels = 0.5,
        border.col = "black",
        palette="RdBu")

Correlaltion between Life Expectancy and Income by Country and Region in 2015

Figure 3 below shows an interactive scatter plot of life expectancy and income by country and Region. Figure 3 also indicates that until life expectancy of 70 years, the income doesnt seem to affect it. However, after 70 years, the countries with higher GDP have much higher life expectancy. This proves that people living in countries with higher income will have higher chance of living above the age of 70 years. The graph also shows that countries in Sub-Saharan region have the least life expectancy and are all scattered around each other. On the other hand, countries in Europe & Central Asia and America regions have the highest life expectancy.

#Plot scatterplot
g1 <- ggplot(data2015,aes(LifeExpectancy,Income,group=Country,col=Region))+
  geom_point()+
  theme_classic()+
  labs(title="Figure 3: Scatterplot of Life Expectancy and Income by Country and Region")+
  theme(legend.title = element_blank(), panel.background = element_blank(), panel.grid = element_blank())+
  scale_y_continuous(labels=scales::comma, breaks = c(25000,50000,75000,100000))

plotly::ggplotly(g1)

Correlation between Population and Life Expectancy

The visual between population and life expectancy did not indicate any strong correlation or any interesting results and therefore is not included.

Conclusion

In conclusion, the exploratory data analysis of the dataset reveleaed that the quality of the data is very good. There are a few countries that are missing some data for many years, however they represent a very small proportion of the world population. The visualizations of the dataset revealed that Europe & Central Asia and America regions have the highest median GDP per capita while South Asia and Sub-Saharan Africa regions have the lowest GDP per capita in 2015. The data also showed that there is some correlation between population and income as the countries with lower income usually have higher GDP per capita than countries with very high population. There is also a strong correlation between the income and life expectancy. The data showed that countries with higher income have a life expectancy of 70 years or above.

References

  1. Gapminder Wikipedia: https://en.wikipedia.org/wiki/Gapminder_Foundation
  2. Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse
  3. Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.
  4. Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2018). cluster: Cluster Analysis Basics and Extensions. R package version 2.0.7-1.
  5. Hadley Wickham (2018). scales: Scale Functions for Visualization. R package version 1.0.0. https://CRAN.R-project.org/package=scales
  6. Carson Sievert (2018) plotly for R. https://plotly-r.com
  7. Martijn Tennekes (2017). treemap: Treemap Visualization. R package version 2.4-2. https://CRAN.R-project.org/package=treemap
  8. Baptiste Auguie (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3. https://CRAN.R-project.org/package=gridExtra
  9. Hao Zhu (2019). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.1.0. https://CRAN.R-project.org/package=kableExtra
  10. Peng, R. (2017). The Art of Data Science. Chapter 4: Exploratory Data Analysis https://bookdown.org/rdpeng/artofdatascience/